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ABSTRACT 

Cpnpetency based teacher educationSitis been defined 

in >arious Ways, but there is general agreement on at least, two b^sic^ ^ 

elements. 'The first essential characteristic is the specification of 
teacher competencies which form the basis of the entire program. The 
second is the design of assessment techniques directly related to the 
specific competencies. Competencies have been written in a variety of 
wa^ys and have been"related to various domains or competency areas. In 
each of the competency domains the form of the competency must be 
examined to determine appropriate assessmefit techniques. There are a 
numbei; of assessment factors which need to be considered in the 
evaluation of competencies. Th^ ftature of the standards, or criterion 
selection, is essential. Other concerns ar^ comprehensiveness ahd 
fidelity of the assessment sys^tem*; validity and reliability of data, 
and general utility of the process. Assessment of knowledge 
competencies can be accomplished-tJirough paper and pencil testing. 
Assessment of teaching behaviors or oerformances, however, requires 
observation of the individual demonstrating the skill. This may be ' 
Accomplished b^rat^^ing .scales or structured observation systems. 
Otilizing samprlng and student achievement have also been used, 
although it has been concluded that student learning measures cannot 
^ fairly used to Evaluate individual teachers at present. (An " 
extensive list of references is included.) (RC) 
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THE NATURE OF AND ALTERNATIVES . 
FOR TETHER COMPETENCY STATEMENTS AND 
*\ IMPLICATIONS FOR ASSESSMENT TECHNIQUES 

• r' ' 

a ■ • • » , 

Competency-basecJ teacher education is perha;^s the most frequently 
discussed topic in education today. Close to 500 teacher education 
institutions (Sherwin, 1973, p. 3)/ and over' 35 states (Roth4^1974) 
have become involved in either studying or developing such programs. 
•Comp^tency^based teacher education has been defined in various 
ways but thfere is general agreement bn at least two basic elem'er^ts . : 
The first essential characteristic is the specification of teacher u 
competencies which form the basis of the entire program.- The second 
is the design of assessment techniques, directly related to the 
specified competencies, which .are necessary in order to determine 
whether or not a student has achieyed the competencies. 



Competency Domains 
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In view of th$ critical role of these competencies it is important to 
review the nature of competency statements and their impilcations - 
for assessment. Conipetencies have been written in a variety of wSfys 
and have been related to various domains or competency areas. The 
competency domains identified in the literature are knov»ledqe , behaviors , 
affect, consequences , and experiences. . Each of tlieSe needs to.be 
examined to determine implications for possible assessment strategies. 

fig )Cnowledqe^domain competencies refer to information and cognitive 

processes necessary for effective instruction and related activities. ? 
These'include IctioWledge-of : a subject area, planning for instruction, 
instructional strategies, child growth and development, human relations/ 
etc. Knowledge in these areas deal^ with facts, processes, theories, 
and techniques. The scope of tBie knowledge competencies wiU'depend 
upon what areas of^the teacher 4ducation program (content area , liberal 
arts, professional education) are included in the competency-based 
program. Examples from various knowledge areas would be an ability to 
balance chemical equations, write behavioral objectives, identify. a ' 
variety of instructional techniques, describe Piaget's stages of 
developm'bnt, and relate counseling techniques appropriate to given ■ ^ 
situations (the specificity of t/Cese competencies will be discussed in 
a later section) • These are usually evaluated by paper and p.encil 

v' processes such as t'ttDse utilized in current traditional'teacher education 
prograf'ms . . , 
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Some educatoM have referred to an area of coimpetence which usually 
Is considered as being either in the area of knowledge or rperformance/ , 
and may belong somewhere between the two. These competejicies have 
been identified as "outputs" and are described In the statement which 
follows : • ' 

^ Teachers produce a variety of outputs which can be • 
categorizea^as either Products , <Events . or Gondltlons > 
Included among these categories of outputs are the . , ^ 
following: 

A* Products - A product is a tangible, concrete, transport- 
able outcome of v^ork effort. 

Instructional units ) 
Lesson Plans 

Lists of objectives ' ^ ^ . / 

Guides , outlines , sets of directions 
Bulletin boards ^ ^ 

B. Events - An event represents an instarvce of dccurrence 
of an observable transaction or set of behaviors'. 

r 

Class discussion ' . / ' 

Demonstration ^ 

Presentation 

Ffeld Trip ^ 

C. Conditions - A Condition represents an instance of 
a desired circumstance expected to endure and to 
influence a program. 
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Parent acc^tance of school program 
Classroom Climate ^ 
School "atmosphere 

Working f^elattoriships with other teachers (Morse, . 

Smith, ,and Thomas, 1972, pp. 11-12) i 

The behavior domain refers to the performance competencies an 
individual demonstrates. These are the actual teaching acts considered 
necessary in order to enable studentp to learn. The performance of 
tedching skills is based on the previously acquired knowledge y 
competencies, but requires a detnonstrgition that tl\e student can 
perform and utilize varibus strategiesjand techniques. Examples here 
include demonstration of a variety-oi questioning skills, introduction 
oi aUessonf Riding students in discovery activities r etc. 
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The affective domain has been identified in the literature the -opinioos , 
attitudes, emotions, and dispositions of the teacher.' This covers a variety 
of specific factors such as sensitivity to needs ofstudents, self-acceptance, 
professionalism, etc.' Human relations training labs and interaction 
laboifttories have b*en esta-blished to accomplish these competencies. 
It is irtiporta.nt to note, however, thik we' may wish to distinguish between 
affective competencies, such as accepting student feelings, which 
are expected to ioe demonstrated in the classroom, and personality' 
variables of the trainee, such as emotional security," which are^ore 
difficult to elicit and evaluate. \ . ' 

Consequence domain objectifies relate to the influence the teacher has 
on pupils. In'these competencies the criterioiji considered is the 
product; i.e. , the^ behaviors or attitude and achievement gains of the 
pupils being instructed . 

The, consequence ar^, hov/ever, can be separated into at least two 
distinct categories, student behaviors and student learning. Student 
behaviors refer to thoae activities students eng.age in which are { 
assumed necessary to attain the educational objectives. Some programs 
are placing -a great degl of emphasis orf evaluating this dimension of 
teacher competence. Examples of student activities include the 
following: f . \ . 

1) students being supportive* and cooperative 

2) students being attentive to class activities . 

3) stpdents particfpating in verbal interaction- 

4) students following specific activities to completion 

5) students using media and resources for study 
(Hatfield, 1974, pp. 41,42) 

An example of the second type, a pupil achievement consequence 
ibjectiv^ Is ^ , . ' 

» . ' \ 

' Given fifth grade pupils who ha^^e not mastered their 

^ multiplication facts, the pupils will be able to master 

. all fhe facts (I- 10) X (1-10) and be' able to corfipleto 

them on a paper and pencil test at a rate of 30 per 
^ minute . The criterion is 90% accuracy by at least two 

out of three pupils within four weeks . 

Experience-or expressive' dorrtain objectives have been described as 
activities an Individual engages in which are outcomes in themselves . 
Ther*e are no specified outcomes which are to occur as a resiilt of the . 
experience, the objective i^ "complete once the individual has experienced 
the activity. An example is "the student will read a story to a 
kindergarten child— while holding the children his lap, " or "the student 
' will visit th6 home of each of his pupilsiJWeber , 1970) . " 



Campetencv Forms 

There eeems to be a variety of viewpoints as to how competencies 
should! '^e written.' One approach is td^ write them as general statements 
of behavior with some broadly definea, expected level of achievement • 
An example of this approach is^'the teacher is able to use a variety 
Of^teaching techniques, selecting thofie which are appropriate in ' • 
particular situations . " Note that the competency is general enough 
to -cover a number of specific behaviors • Also, the standard of 
^acjiievement "appropriate" is not very specific and provides tor a more 
subjective evaluation. These are high inference types of competencies. 

Merwin (1973), however, argues that PBTE iS supposed to differ from 
current teacher education programs by the explicitnesfl with which 
the competencies and the criteria used in assessing their mastery 
are stated. Further, this explicitness should ^eave little or no 
ambiguity regarding procedures for assessing the performance nor in 
arriv^^ at a decision as to whether or not the individual possesses it. 

In addition, McMrsa, et al . (1972), believe .that/ evaluation goes 
beyond measurement of performance. Judgments have to be madq - 
in relatioa to'those factors which give meaning to the performance 
Information produced./ Central fo this judgmental process is a clear 
delineation of what it is the assessment is to assess. 

Pursuing this line of thinking, another approach would be to develop 
specific [Performance objectives derived from the competency statement. 
These specific performance objectives are behaviors which must be 
demonstrated as evidence that one has- attained the generic competency 
from whi(t:h they^ were derived. In this situation, the evaluation focuses 
on the demonstration of the more specific behaviors and achievement 
of the competency is determined by whether or not most or all of the 
specific performances were demonstrated. This is a lower inference 
typ3 6f objective and is somewhat less sujDjeotive in nature. Ayi 
example is 

competency: The teacher^teainee is able to use a ^ 

variety of teaching techniques . 

performanccf^ The teacHfer trainee will demonstrate 

objectives I ability to give a lecture by stating^ objectives 

clearly, asing an audible voice, varying ' 
the pace, establishing eyQ contact, and 
summarizing key points. 
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The teacher trainee will demonstrate 
ability to cohduct a group discussioA 
*by defining the topic, involving' all 
students, summarizing key points, . . . 

The teacher trainee will demonstrate 
ability to employ oral questioning. . . . 

The teacher trainees will demonstrate 
ability to giv^'a demonstration . . etc. 

Y 



Competency statements may also 'be written as behavioral objectives . 
This is the type of competency statement most frequently believed to 
be associated with competency-based programs . In this approach, 
the behavior, mastery level, and conditions are specifically stated, 
with the criterion levels stated as frequencies, per ceat accuracy, or 
other such measures . In this approach*, comjiatency statements can 
be used" directly as si'sse^^^ment criteria. Examples of behavorial 
objectives are 

Givien examples of classroom management techniques 
' (written descriptions or videotaped) the teachdr trafnee 
will i^ntify by name at least five, of six correctly. 

* ♦ * \ • 

Given a small group of students^in a microfeaching 
session the trainee will ask one knowledge, one 
application, and oij^ synthesis type question as 
developed in his lesson plan within a twenty-five 
minute lesson. 

In each ^ the cqmpetency domains cited, the form of the competency • 
statement must be examined to determine .appropriate assessment 
techniques needed to evaluate achievenntent. It should be noted that 
the assessment strategies are affected by a variety of variables related 
to competency statements'. As each of the competency areas are 
examined in the following pages, variables such as context and 
^specificity will be considered as they relate to the particular competency 
domain under discussion. •* 



Agsessment Factors 

In order to determine implications and problems of assessment of 
competencies, an ai?alysis of the literature was conducted to determine- 
^ssossment practices and concerns. Remaining sections of this paper' 
reflect these findings. 
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There are a number of factors relating to assessment of competencies 
in general. One such concern is the evaluaition context. For example, 
if the individual is required to demonstrate that he has a particular 
skill, he might accomplish this by teaching to one or two peers, a 
.small group of students , or an entlrb'clasa . In each instance, he is . 
demonstrating that heUcan perform the skill, and each of these 
alternative contexts may be appropriate. 

* In some cases, however, the competency may req\iire that the individiial • 
not only be aWe to demonstrate a particular skill, but that he utilize 
this skill at the appropriate time or at a designated frequency as part 
of his .normal teaching style over a period of time.. This requires not 
only that the individual "can do" but "does do." This type of competency 
requir68 the classroom as a context for evaluation-, as well as a longer 
period of time for observation. . 

V ' 

The nature of the competency statement clearly has implications for 
the context required. On the otl^fr hand, the context in which 
assessment takes^ place has a direct bearing on the nature of the outcomes 
and the data collected in the assessment process. Context variables 
need to be considered when evaluating competencies, and some 
standardization is necessary (when possible) in order to make comparable 
Evaluations . * ' . 

As an example of this relationship, with particular reference to 
performance standards, Schaiock, et al., (1974) discussed the competency 
"defining the objectives of instruction." He^iOints out that there is 
nothing inherent in this corhpetertcy that is addressed to the quality 
expected (standard) nor is there any reference to the context in which 
performance is to te(ke place . Also 

Because of this interdependency 6f competency descriptor^ 
the context in which competency is to be demonstrated, ^ 
and the performance standard set for its demonstratibn,' 
the task of becoming clear as to what the assessment system 
was to do and how it was to do it was more difficult l^han 
- anticipated (Schalock, Garrison, and Kersh 1974). 

Although the setting of standards is a key element in the desiga of 
assessment, it is not an easy tas^. 

Obviously, there is no source^ other £^an Judgment, 
to which one can refer to select appropriate standards . 
The question of standards is one which plagues all evaluation 
-efforts. HcjArever, the nature oj^ the competency, its 
relevance tp instruction , its suspected impact on class- 
room learning and other such consideration? should be 
weighed in setting the standard (Airasian, 1974, p. 16). 
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Some assistance in the criteria selection process is provided iirthe * 
following statement. It should be noted, however, that Uiis was , 
written in terms of assessing teachers in general rather than assessing 
specific competencies . ' " , t 

Six Attributes for Discriminating Among Criterion Measures 

1) Differentiates among teachers . There ape decisions * 
where we do not have enough knowledge merely by 
knowing %hat a teacher has met a mihimal level of 
proffciency. Both administrators and researchers , for 

" instance, oftfen encounter situations where they need 
a measure Isensitive enough to assess variance in 
teachers * skills . - , ^ 

2) Assess learner qro_wth . . emphasize the necessity 
to produce criterion meaS'ures which can be used to , 
assess the results of instructional process, not merely 
the process'itiself. In certain limited instances we may 
not be interesteid in the outcomes of instruction as 

reflected by modifications in the learner / but these x 
would be few in number. Certain classes of criterion 
measurers are notoriously deficient with respect to 
this attribute. • . ^ 

3) Yields data uncontaminated i^v required inferences . An 
attribute (^^considerable importance is whether a 
measure i^fermits the acquisition of data with a minimum 
of reiuaired extrapolation on the part of the user. If all 

' > obseiw^ons are made in such & way that beyond human 

' 'frailty/they have not been forced through a distorting 

in^ef^tial sieve, then the measure is better. A class- 
room observation system which asked the user to record 
the raw frequency of teacher iiuestions would possess the 
attribute more, so than a system which asked the user to 
* judge the warnith of teacher questions. 

* 4) Adapts to teachers' goal preferences. A measure of 

teaching skill will be more useful fbr given situations 

if it can adapt to such dissimilarities invgoai preferences . 

5) Presents equivalent stimulus situations . There are tirties ^ 
when we might like to use a measure which would permit 
the measurement bf teaching prbficiency when the 
stimulus situations were identical or at least comparaJ?le. 
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« 6) Contains heuristic data categories . ^In a sensd* this 
* final attribute is the reverse of attribute "number three 

abov« which focused on the collection of data uncontamlnated 
by;requlred Inferences. At tlmes^we want data that simply 
state what was seen and heard in the classroom. At other 
' times it would be useful to gather information — interpretations-- 
which illuminate the nature of the instructional lactics. 
For the uniophisticated 'individual, in particular, measures 
which wpuld at least in part organize his perceptions- regarding 
y strangths and weaknesses in teaching would in cfertain 

situations be most useful (McNeil and Popham, 1973, pp. 
238-239). 

In selectifig assessment prbcedures the influence of the nature of the 
competency statement in this process has 6een stressed. Some general 
poinis to consider in designing assessment are: 

A. Objective instruments development vs ^ subjective.^ 

' B. Effect of assessment on process . 

C. Selection of acceptable indices . 

D. Esjtablishing validity and reliability (Balrd and Yorke, 197T7T^5) 

The validity arid reliability of measurement instruments are, of course, 
important considerations. These will be discussed at length in appropriate 
sections, and therefore only briefly here.. The evaluaftor should decide 
which of the validations are pertinent to the' instrument being used. 
According to Young (1973) -face validity is the most common form, and is 
concerned with the ^^nstrument agreeing withrthe mode of responding 
(written or verbal) whether*it measures process or product, and the level of 
responding (memory, conceptualization, etc.). Alsq, content validity 
may be estimated by expert ratings of each item, Construct validity is 
estimated by giving the evaluation to a group of persons possessing the 
trait and to a group not possessing the trait, and predictive validity is 
determined between levels of ^valuations or in different time periods. 

Some general recommendaUons regarding prbcedures are provided by the 
following : ' ( ^ 

• Actual data gathering techniques to evaluate knowledge and ' 
practice competencies are not complex. For knowledge 
competencies paper and pencil tests, oral examinations, and 
the like are appropriate. For practice compefencies , studies 
of performance 4n classroom) microteachiftg , or other similar 
situations can be Evaluated by one, br preferably more, 
fudges on the ba^is of checklists, or overall performance 
(Alrasian, 1974^ p. 17). 
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■ stimulus and response modes. could be specified from a number . 
of avallable"alternatlvds.. Prospfective' teachers mlgh^rfespond 
tQ i)j|,per and pencU^ directions in -a video^ped mode— or vice-versa .' 
Filn#, audiotapes and actual deTmonstratlorvs are other possibilities-. 
Responses n^ed not be limited to overt teacher behavior as the only 
teacher "prd&uct" but could include jDroducts such ^s lesson 
plans, teacher-made tests, rep6rts to parent, and other record- 
keeping and planning outcomes. And of course, one stimulus 

> hiay produpe a series of responses. in various modes (Kay, 1974, 

An overview of some possible techniques has been developed in term^ 
ofWcrtteria, comprehensiveness and- fidelity . FideUty refets to the 
degree of realism of the test compared to the crttferion_situation . - 
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FIDELITY 

Fidelity and Comprohensiveneos of Different Types of.Tests, 
(Quirk, 1974) . ' ^ \ 

Another general concern, no ma((^cr what the competency domBin or , 
M-ses'sment technique, is that of utility. This asks oi each data 
gathering effort whether the costs of time, rponey and effort can be. . 
justified iDy the extent to which they reduco'risk fordecision makers. 
According to Merwin (1973) th'ere are two ways to apply this criterion. 
One is to ask the extent to which the^addgd infonmation provided has 
reduced risks in selecting among' altelrna^vee , and the second involves 
compjarbig the costs of this particular means fo getting the information 
with costs in using another means to the ^me information or equally 
predictive information- highly correlated with it (e .g • /indirect vs. 
direct assessment) . ^ * * ^ 



One meang. of "Viewlngi cortipetencies andiheir asses,srn^nt has been ^ 
developedjDy Turner and sKould be mentioned at t;his- ipolnt . His»six 
ciltei:tdrrlL6v6J^ for evaluation are asir follows: 

• '* • * ' * 
Criterion Lev^l 1 , At the Highest level, the criterion 

> against which Jteachers/ (or teachlrig) Mght be' appraised 

' consists of two ,parts . . The first part is observation of 

the acts or behaviors in which the teacher engages in 

^he classroom. , Thex)bservattojas must be conducted v / 

.with a set of instruments which permit clgisslfication / . 

of teacher beha>viors in Koth the cogriijivd^'and afJective 

domains. The second part is systejifatlc analysis of the 

-\ ^ ' .level of outcomes aGhi;eved by the teacher with the pbpils 

he teaches, butcom^s in both the cfjgnltjye 9 nd affective 

" , domains must be incjuded> Because%f variation; in the. • 

entry behaviors of students and variatibris Itl teai(5hi'ng 

contexts / the residual outcomes in pupil behavior (the ^ 

. terminal behaviors connoted for entry behaviors and moderating 

variables) should be used as the criterion measures ..' To 

be placed at criteribn l^el 1, the above two-part appraisal 

of te^^cher pelc^rmance must be conducted over a relatively 

long period of time / probably at leas^^^o years (on a tiftie 

sampling basis) wjXh both the observational and residual 

pupil bel\avior components assessed during each of the 

years. The reason for the two-year period is that both teacher 

and pupil behavior ari^ openvto some random fluctuation and 

care must be taken to obtain a sufficient sample of behavior 

from both sources ti^assijyre fair conclusions. ^ 

Criterion Level 2. This qriterion level is identical to 
criterion level 1 except that a shorter performance period 
is involved. ' . ^ 

Criterion Level 3. This criterion level differs fromi criterion 
levels 1 and 2 in' that pupil performance data are eliminated - 
from the criterion . Judgmefnts about competence or proficiengy 
are thus based on the observable behaviors of the teacher • 
rather than on the pupil outcomes associated with these 
behaviors. 

Criterion Level 4 . This criterion level differs from criterion 
level 3 in that both the teaching context and the range of 
teacher behavior observed are restricted. The context might 
be a typical microteaching context involving a few pupijls or 
even, peers acting as students. The teacher behavior observed 
would be restricted to a few categories in the cognitive or in 
the affective domain . 

• ' I3^ ' 



' Criterion Level 5 > This criterion level differs from criterion 

level 4 In that the teacher need not perform before live 
students (simulated students would be satisfactory) . He 
must, however, be^ble to produce or show^in his behavior 
at least/one teaching ^klll; e.g . ^probing • 

enter Ion. Level 6> this levdl differs firom crlte/lon level 5 
' • in^that tli,e teapher need' not eri^ge in producing a performance*, 
but rather, only, show that hie understands some behavior , ' 
concept, or principle germanfe to teactiing (Turner, "K9'7 2 , p. 3)- 
' , . ■ «^ ■ - ' $ - ' * . 

The relationship between these levels and assesstaent techniques will 
be identified at appropriate-points throughout this^ pap^r. 

A general overview of this relationship is provided by the fc3tllowing 
chart: - 



WHEN 



WHAT 



HOW 



pre- , 

pr act! cum 



pre- 

practlcum 



Level 6- — Trainee shows that he 
understands^ some behavior's, 
concepts , or principles germane to 
teaching — usually in a paper 
and pencil exercise. 

' \ 

Level 5 — Trainee demonstrates his. 
'possession of teaching "skills", 
however, he need not do so with - 
students . He may Interact with 
case studies or other simulated 
materials . 



paper and pencil, 
tests; interviews 



case studies; 
simulation' 



pre- 

practlcum 



practlcum 



practlcum 
and on, the 
Job 



Level_4j -Trainee demonstrates ^ 
teaching behaviors in a nj|lcro-^ 
teaching context with a few • 
students or peers . 

Level 3 — Trainee is judged on the 
basis of his ability to 
demonstrate "teaphlng bejigviors" 
in the classroom . 

Level 2 — Short-range outc 
achieved by the trainee with 
the pupils he teaches. 




mlcroteachlng; 
interaction ana^sis 



videotape; observation 
forms; questioning 
pupils; interaction 
analysis 

all tools used to 
assess public 
school pupils* 
growth (including above) 



on the Job Level 1 — Long-Grange outcam'es all tools used 

achieved by the trainee (now a to assess' ' 

' . ^ certified teacher) with the pupils public school 

he teaches • pupils ' growth 

(including above) 

.^Balrd and Yorke , 1971, p. 7) 

KnoWledcie Assessment • 

Assessment of knowledge competencies generally can'be accomplished 
through paper and pencil testing, fhis can easily be done in the preservice 
college classroom requiring very little in the way of special settings , 
instrumentation, or techniqQes. In^adidition, there are other ways of 
assessing knowledge, such as mediated stimulus-response techniques. 

As an example, Okey and Humphreys (1974) suggest audio recordings* 
of classroom discussions used to teach and assess ^he skill of 'identifying 
c&fferent types of teacher questions. Also, they suggest videotaping a 
classrooDfi to teach and assess the ability to use reinforcecnent . 

, ' . : • V- : . , . ' 

In another example, Popham (1974, p. 54) suggests alter;iative assessment 
approaches for the competency statemejut "Teachers must be able to both 
select and generate defensible instructional objectlves. XDne procedure 
requires tea^chers to generate a set of measurable objectives, then have 
these Judged by others using criteria of significance, suitability 'for 
learners, etc. Also, a teacher could select a specified number of 
^objectives from a larger pool, and these could >be Judged according to 
established criteria . Popham .^further suggests that the teacher could 
describe, in an exam-type setting, alternative procedures for selecting 
and generating defensible objective? . > 

The knowledge category, you may recall, refers to facts, processes, 
theories, te<:hniques, etc., encompassing a variety of cognitive processes. 
It has been noted by Dziuban and Esler (1974) that many learning tasks 
are inherently complex because of the interaction of their components and 
thus do not l^nd themselves to being dissected into very small parts. ' 
*In structuring a laboratory problem, for instance, a student may have 
wide latitude in formulating hypotheses, structuring experimental 
procedures, and interpreting data. ' ^ 

Of all the assessment areas^ the knowledge area is perhaps the most 
developed. # 

For three-quarters of a century, decision makers of one 
kind or another have wanted to assess what candidates for 
teaching positions know . Measurement technology for 
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• ^ for assessing academic knowledge thus Hecame highly - ^ 
jieveloped, Cons-equently , we now have widely available 
tests of knowledge of subject matter anci of .knowledge about 
teaching methods. (McDonald ,%1974, p. 23) ' - 

In spite of this prodigious efforl'Bnd its advanced status, there are 
•ar nupib^r of problems to consider, iparticularly when develpping 
assessment for instructional units4n teacher edvJLGation progra/ns • 
Since each module or course has its own objectivl^s, existing tests " 
of knowledge may not be applicable • . 

Aiso, in a program thaVhas specific objectives and mastery levels, as 
competency-based programs purportedly do (particularly An the knowledge 
domain), the assessment is related to'thjp specific Objectives . Its 
purpose is to determine whether an individual hasT attained mastery of 
thiB objective as specified by a criterion level, not how he' compares 
with a groijp of peers J This requires qriterion -referenced testing . , 

In shifting to criterion -referenced testing^ however > one encounters 

a problem in applying traditional psychometric dharacteri^tics of 

tests. Definitiohs of these characterisj:ics , such aS reliability bnd * - - 

validity, involve assumpticuis not consistent with criterion -referenced 

tests, • 

I -* • * « . 

Many of these definitions Involve 'equality of form and * . ' 

content among Ht^ems as . well as considerations of equivalent 
item difficulty, \hese characteristics produce instruments 
of extreme homog^iilty and low variance. Additionally, 
criterion referenced tests derive their meaning from the 
* relationships they describe betw.eeri the items and predetermined 
criteria (Dziuban and Esler 1974, p. .4). 

Another previously mentioned problem Inherent in competency-based 
programs is the need to establish mastery levels for each of the ' 
competencies. There are several factors to consider in this process. 
Quirk (1972) states, a number of cautions in using criterion scores, * 
indicating they should take into consideration the nurr^er of test items 
per objective, the level of difficulty of these items, and a statement . 
of the minimum performance level. Quirk also cites three factors 
related to setting cutoff scores to indicate "mastery" including 
1) standard error of measurement, 2) the "ic-percent correct "^phenomenon 
and 3) the multiple cutoff model. Quir,k notes that a test with low 
reliability would have a very large error of measurement in trying to estimate 
a score that represents "mastery. " In referring to the "x~percent correct" 




phenomeoon, Quirk states that the percent of Items that any given 
candidate answers correctly depends on the content ot the Itbms, 
and the difficulty level of the Items In the test as well as hp ^ • 

personal state during the test . If alternate forms of the test are to be 
used, the forms ne?d to be eqv^^ted statistically. 

Some consideration has been given to describing a teacher candidate's 
overall ability by developing a competency profile, with qompetencies - 
along the horizontal axis and degree ol achievement along the vertic|l. 
axis of a graph. It has been suggested^that such an approach would 
assist employers iir identifying better qualified teachers and those 
with skills which are particularly siHted to their schools. This is a 
type of multiple cutoff or parallel stalk model as referred to by Quirk, 
and he has expressed some concerns. For example^ if a candidate were 
to perform better on one objective than another, and the two objectives , / 
were highly cortelated, the reliability of the difference scores^ould 
be quite low, even if the reliability of both measures were hignU This 
same concern applies in evaluating the performance of the same candidiste 
on two different objectives, or on the retesting of the same objective. 

Also, according to Hills (197 1>, such scores can be set arbitrarily 
without adequate evidence on the validity of the- variable that is being 
used for selection, as well as the validity of the available measured ^ 

An additional concern in the ^rea of reliability relates to retesting 
(this concern was cited earlier). Some competency-based programs are 
achievement rather than time based. Students progress as th^y complete 
competencies only, not by accomplishing as much as they cah in a 
course-restricted by time. Students are allowed to'be retested until they 
achieve mastery. Also, for modules, pre and post tests are provided on 
each objective". Such situations require an examination of the reliability 
of the difference score. The reliability of this difference, score is . 
likely to be quite low . 

/mother set of consideratijjns relate to behavioral objectives . In . 
citing the long lists and number 6f behavioral objectives in competency- 
based programs. Quirk (1974) states the main measurement problem to 
be the reliability of the individual measures. Dividing the petformance 
of a prospective teacher into finer plfements could produce aft • 
unsatisfactory reliability figure. Also, acc'iqfding to Dziuban\and Esler 
(1974) practical cons-iderations often''dict^te testing competen^pies which 
are only indirectly jrelated to the true goals of the behavioral. Objectives . 
This same discrepancy,, however, has long been noted i;i-BOrm referenced 
Instruments . 
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AHhough Quirk has offered several criticisms of criterion-based 
testing in competency-based programs, disagreement with his 
arguments are also found in the literature; Cox (1974), for-example, 
a^g^es that many of the traditional, measurement principles, such^as 
the standard error of measurement, thi reliability of the difference 
scores, and predictive validity, have been developed for nbrm- 
referenced tests «nd are probably not applicable to criterion-referenced " 
measurement/ A number -of psychometrician? have studied criterion- 
referenced test reliabUity (e .g. , Livingston , 1974; Carver, 1974: 
Hambleton and Novick, 1973) and offered their analysis . According to 
Haladyna (1974) "each differs, and each suffers from a paucity of ) 
empirical stqdies either confirming or dis confirming the respective / < 
approaches . " . , 



Teacher Outputs A " ^ 

Previously in this paper teacher outputs were identified as possibly 
being a unique group of teacher competencies as opposed to being 
classified under the knowledge or performance category. A rationale 
iot considering this area of teacher competence and. its implications 
for measurement aref provided by Morse, Smith, and Thomas (1972). 
Outputs, as- they define theip, represent primary, observ^l^ dimensions 
of teacher productivity, and serve as a. bridge for conrtetting teacher 
Ijehavior with learner outcom.es . 

As a result of the performance of tasks various wfputs 
I . will be produced. The outputs teachers produce are • 
achievements for which they can be held directly 
accountable. They are defined as the sole means by which ■ 
teachers 'perform their responsibilities toward learners . 
Teachers -can cpntrbl the outputs they produce, they can 
al.<n predict with varying degrees of accuracy the effects 
their 9utputs 'areJAl^ly to have on learners . By 'distinguishing 
between teaehSfiout^ts and learner outcomes , one can 
- 'give substance to the technical outcomes of teaching 

behaviors. This procedure emerges from and is consistent 
with the position that in order to nurture certain learner 
outcomes Ihe teacher must do something . , The things 
done include systematically using or developing materials, 
providing various experiences, and creating various 
climates or conditions thought to be conducive to learning . 
To that extent, it is these things! i.e. , outputs, for which 
we can hold teaching behavior responsible or accountable. 
The teacher's responsibility includes assuring the relevance 
of those outputs to meeting the individual and collective needs 
of pupils .» 
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The kind, quantity, and qu*ality of^utputs that teachers 

produce can be measured. Th^e measures constitute the 

basic data to be collected In any effort at assessing 

competence.- The eventual Itnklng of this data to^data 

gathered about learner outCQm^^s^^puld provide a rich base ^ 

of information from which to^^araw in making judgments about ^ 

the pompetence of teachers (Morse, Smith, andThomas , 1972, p.' 11). 



Penormance Assessment 

Teaching behaviors or performances require observation cff the individual 

demonstrating the skill. This may be done by personal obser^fetlo^or use 

of recording equipment, with or without the utilization of systematic observation 

scales. WhV evaluate teaching performance, why not de^l with the ultimate 

criterion of effectiveness, pupil learning? Mijich more will be written on 

this in the section\)n pupil achievement, however, the following rationale 

has been noted in the literature. 

^ Measuring teacher effectiveness by measuring change in - 
^ ^ pupils is probablyfonly feasible for simpler; lower level 

objectives. * * * 

Tor the attainment of hlgher^level ctbJ6ctlves, or more slowly 
developing objectives, the more appropriate procedure apt^iears 
to be to measure the behavior of the teacher and compare it to 
behavior which is thought to be related to the development of 
higher level objectives in pupils . Such a procedure appears 
' feasible, both for the assessment o^competence of individual 
teachers and for the certification of programs (Soar, 1973, p. 210). 

Similarly, the teacher appt^ars to be more fairly evaluated. if 
the Judgment is made on what he does, rather than on the out- 
come of what he does . The first is under his control and the 
second is not (or at least not nearly so much so) (Soar, 1973, p. 209). 

In reference to Turner's criteria, Merwin (1973, p. 12) notes that Turner's 
lower criterion levels involve assessing teaching behavior which is suppled 
to bring about a desired change in pupil behavior. He argues, however, 
-that such a substitution can only be justified on the basis of a demonstrated 
reliable relationship between the assessed teacher behavior and change in 
pupil behavior that would be measured using the direct assessment applrbach. 
.Currently, both traditional and competency-based programs must operate 
Without such validation. 
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Some general concerns related to assessment of teacher performance have 
beeo identified by M'efwin . He cite;s 1) error due to a lack of comparaj^ility 
in conditions under which the measure is taken; 2") errors in observing and 
recording behavior; and 3) inaccuracies in the matching^of the observed 
behavior against the criterion behavior in attempting to arrive at the yes - 
no decision regarding achievement ofcompetQncy (Merwin, 1973, p. 10). 

As noted in the introductory pages, the s^flection of evaluation techniques 

depends partially upon the specificity of the competency. Examples of 
. teacher competencies in the performance area may be useful at this point 
'to illustrate some of the problems encountered due to level of specificity . 

The foUl owing examples were deri^d from The Florid a CatalOQ of Teacher 

Competencies (Dodi, 1973). 

1) Identify a student's instructional heeds on basis 

of errors . ' - . 

2) Involve students in teacher-pupil planning. 

3) Struct^e opportunities to develop health and 
^ / safety habitp 1 

4) Help students develop attitudes compatible^ with society 
and self. - * ' 

S^) Cause student to perceive relevance of learning. 

6) Use variety of media in, course of teaching lesson 
^ or unit * ' ' ' 

Merwin has artalyzod these competencies and provided the following 
concerns . • • ' * 

In the first example, "Identify a student's instructional 

needs on basis of errors," one assessor might well 

accept a simple oral questioning procedure while another 

might consider only careful classification of errors 

established on a theory of development as adequate. 

As evidence of "involving students in teacher-^upil 

planning" (example number 2) one judge might accept 

allowing students to say what they want to do, while 

another may feel that the observation is not complete 

until completion of what is jointly -planned . T^ compleSHJjies , • 

and alternative procedures that might be involved in 
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determining whether teacher hap "caused" a student 
to perceive relevance of learning (example number 5) 
. ^ are almost unlimited. Whether what Is needed to make these 
competency statements functional for directing measure- 
ment efforts .Is greater expllcltness In the behavior to - * 
be observed, the need for Adding criteria of acceptance,- 
or both. It must be recognized that thay do nfit provide I 
an ad^uafe base for designing assessments as' they 
stand (Merwln, 1973, pp. 12,13). 

^Even without the criteria statemints needed to Judge the 
adequacy of expllcltness for unambiguous-ly directing 
development of the measurement procedures to be used, 
a number of aspects of these statements pose assessment * ^ 
problems. For "example, there are bound to be difficulties • 
in designing procedures to determine the amount of "help" 
provided t)y a teacher In attempting to demonstrate his 
competency to help students develop attitudes compatible 
with society and,self (number A) . The variety of media 

•'available and practical will vary widely from situation to - 
situation in assessing a teacher's competency to use a 
variety of media (number 6) (Merwln, 1973, pp. 8)9). ^. 

The tenuous nature of criterion levels was examinecj In the preceding 
,knowled(je dpmaln section, and these concerns appl^ to performance 
level! Also cited earlier as a factor In ther evaluation of teaching 
performance Is the context called for In the competency statement. As 
one reads the above comments It Is Important to note that many of the , 
.concerns have relevance only within the context of an un8truc|:ured or 
uncontrolled (experimentally) environment such as a normal classroom. 
A very different context Is provided by slhitilatlon situations where 
variable^ are controlled and the context Is somewhat structured. This 
situation is analogous to Turnerfe levels four and five. - 



Working under limited birhiiiaUpn procedures to assess 
teacher behavior during IntpMfctlon with pupils as called 
for at level four allows mdre control^of conditions , .permitting 
greater objectivity and focus of observation of teacher 
performance at a cost of. some realism. Level five simply J 
provides further control of factors affecting the assess-- / 
^ ment of teacher performance at thtf cost of possibly a crucial ^ 
element,, use of live students. (Merwln^ 1973, p4). 16,17). 

In discussing performance assessment, a number of references to 
context have been made In the literature. Morse, Smith, and Thomas (1972) 
state that the nature of the context; I.e. , the people who make decisions, 
the setting and the role being assumed, plays a crucial part in determining 
the way In which an Individual will be Judged as to competence; How 
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the competency is defined, the focus of investigation, the criteria and ; 
standards are all said to be a function of the context in which assessment • 
is to take place . » . ^ , * . . 

Okey and Humphreys (1974). point out that performance outcomes- are the 
doing skills of teaching, many of which require a classroom setting ^ 
^hlle thby are learned and assessed'. Furthermore, Garrison (1974) 
relates that, in his experience, in q program that defines competency 
in terms of the. performance of teaching functions in ^n ongoing school' 
setting the identification of thq^contexts in which competencies are to ■ 
be demonstrated becomes as critical as the^identification of the competencies 
themselves. > 

In addition to the context elements , referred to above , Merw In ( 1973) 
points to two other concerns related to context .and lierfgrmance assesslment,' 
namely, i)^e content under study and method^ of teaching, and 
2) the background relative to the topic under study that-the pupils bring - 
to the learning experience. This latter concern as to the background 
of the pupils assumes that the task^of the teaoher^ will be different if 
the children are relatively hpmogeneous with few deficiencies , as . 
opposed to a heterogeneous group of pupil^, somte having considerable * 
deficiencies . Also the personal characteristics "and attitudes tbward 
schools and learning of the pupfls should be considered when evaluating 
the performance qi the individual teacher-. ^ _ 

Howell (1971) mentions a two-fold problem in terrrts of gathering data 
in evaluating performance* All factors likely to have major effects on 
the learning in question need to be described, as well as possible^ 
extraneous influences on pupil performaijce from which the data are 
obtained. 

The known sources of possible contamination can often 
^ be dealt with in designing the evaluation procedures, 

and unknown ones can be countered by sampling teaching . . 

performance generously and aN/eraging results over a number 
of o^^^asions or over many learners . But this may be 
expensive. The sample size, the sampling procedures, 
control over puiJil situational variables .to assure comparable 
conditions for the pre and post-learning performances, and 
recognition of interventions other than teaching--all are 
problems of the validity of the data , which are quite 
distinct from problems of the validity bf the theBretical 
constructs or of the teaching purposes. . .(Howell, 1971, p. 21). 

One competency -based teacher education program has described its 
approach to assessment which accounts for context. 



The apprdach taken to the ;neasurement of individual 
teaching Competencies was one of obtajlnlng carefully 
delimited professional Judgments , in the form of rating 
scale plapements, a^ to the adequacy, of a student's 
performance In a partlc>ilar demonstration context. At 
least two separate professional Judgments were obtained 
in relation ^o each competency demonptration, one from a 
student's college sijpervisor and one from his school 
supervisor. An evaluative Judgment was also obtained 
, frpm a Qpntent specialist if a student requested it. The 
ratings i^rere designed so as to accommodate the impact of 
""SWtlng^differences on competency demonstration 
(Garri^n, 1974, pp. 65-66). , — . 

terwin (laTpfTias citea\8everal conc^ns rfilated to assessing performance 
inci^idin^^ in ^tA^^^c^^^^^^ and reproducible observations , 

samplttlg problems' involving el^nients of time, environmental factors ^ 
surrounding the performance under observation, and characteristics of • 
both the pupils and the type .of leanrning involved, 

Baird and Yorke haye focused on problems of selecting context and the 
timing- of assessment such as 

^ ^ A. One setting, one time vs^ many jtfettin^s^ many times. 

B. Early {in the day, week, semester, etc.) vs. late* 

C. Before (diagnostic) during (formative), or after 
(summative) instruction (Baird and York, 1971, p. ^ 

The predictive validity of performance assessment, particularly in 
student teaching or similar type situations, is also 'a problem because • * 
the^prediction of individual differences for future performance could bo 
unreliable due to the limited range of- performance observed. Yet it 
has been noted that ^ 

. . .what the student teacher does under a specific set of 
circumstances at a given pbint of time is of less concern than 
what the performance tells us about future performance — the 
validity of the assessment of predicting future effectiveness 
in helping pupils learn (Merwin, 1973, p. 22). 

\ 

A note of caution, however, has also been provided. McDonald cautions 
that: , \ 

We cannot treat teaching as if it were so different on each 
separate occasion that we can never evaluate it. The conflict 
between establishing reasonable expectations for teaching 
performance and the variety and complexity of the situations 
in which teaching occurs is one of the most important problems 
we have to solve. Until it is solved, our^decis ions about 
competence must necessarily be tentative (McDonald, 1974, p. 22) » 



A similar (or even synonymous) concern r6lates to sampling, and the 
relationship of an individuars performance at a given point in ttm6 to his actual 
ability to demonstrate a skill should he so choose. This relationship / 
between •'performance " and "competence" is, in a sense, a predictive 
validity issue affected by adequacies In sampling . Several writes have 
e^Q^ressed concern over this issue; .. v y 

y • , . 

f A major matter of concern revolves around sampling which will ' 

permit defensible generalizations (Merwin, 1973, p. 10). . 



The extent to which evidence gathering situations permit 
' studeritS'to manifest the behaviors inherent in the competencies 
is the extent which the evaluation is valid . . .Any testing , 
situation provides dnly a sample of a student's behavior 
(Airasian, 1974, p. 17). ^ / ' . . 

'"^A difficult p:oblem associated with monitoring the activities * i 
assigned in a ciGssroom is that of sampling. The drawing 
of reliable /samples., of course, is a dif|icult problem 'regardless 
of the observational system that is, being employed (Raths , 1973, pp. 20 

' There is the related .problem of sampling . Does the absence of 
an item frdm a person's speech mean that he cannot produce it 
or merely that he has not found it necessary to produce it 
* (Dill, 1974, p. 9). 

Imbedded in this issue is the question of performance versus competence. 
, Dill (1974) argues that teaching competence is not to be confused with 
teaching performance. Teaching performance is what the teacher actually 
does, and is "based on knowledge of the instructional content and pedagogy 
as well as other factors such as memory, non-jiedagogical knowledge and ' 
beliefs, distractionjs , fatigue, etc. In atudying actual teaching performance 
one must consider a variety of factors and the, underlying competence of the 
teacher is only one factor. 

In the following cbmment teaching competence is viewed. as only being 
observable in a very controlled situation where context variables have 
little influence. It should be noted again, however, that this assumes 
evaluation of "competence*' as opposed to a specific "competency." 

Evaluation of a teacher's competence in a student teaching situation 
requires accounting for a variety of factors, whereas evaluation of a 
specific competency-in a microteaching session is less complicated 
and "performance" is more directly related to a competency. 

Only under idealized conditions can'^teaching beR&vior 

be taken to be a direct reflection of teaching competence. 

In actual fact, teaching performance cannot ever directly * • 
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Mfle'ct teaching competence. Observation Qf actutil teaching 
behayior will show numerous false starts , deviations from 
plans, etc. Teaching competence, thd^i is' concerned vWth 
an idea^ teacher, in a completely adequal^' classroom, who r 
knows the pedagogy and content perfectly, and is unaffected 
by classroom conditions of crowding, inattention, distractions » 
etc. (Dill, 1974, pp. 29-30). * 

„Howell(l971) has also distinguished between teaching competence, - ' 
.teaching competencies / ^n(J, teaching performance in the following manner; 

i) Teaching competence, as such, is not directly 

observable but is generally regarded as a more Or - , 
.less enduring personal characteristic; ^ 

. * 2) _ A specific teachinfl competency, too, is presumed tp 

be persistent and nance applicable to a whole series of ^ ' 
situations within the limitations of its ^definition; 

3) A teaching performance, Kbwevery, is the observable man!-- 
feqtation 9f teaching competence, or competency, and 
^ - " is bound by time and place and other general situational 

variables, which define its setting or context (Howell, 1971, pp. 4-6). 

Perhaps the most widely vised method of assessing teacher performance is 
subjective rating, where an observer evaluates the teacher or trainee 
on the basis of his own criteria and interpretation of the situation. 
Problems with ratings again focus with the observer. Popham (1974) 
suggests that the difficulty may be due to different notion? that raters 
(administrators, peers, students, etc.) have regarding what constitutes 
good teaching. Quirk (1972) suggests that one method used to avoid this 
problem, or at least modify it, is to train raters carefully on the definitioa 
of the i^ems, show the raters examples of teach^ behavior for each 
item, and check for the reliability of the ratings using actual cWssroom 
situations . ^ 

Another method of assessing teaching performance skills that has 
received considerable attention is the use of systematic bbservation^- 
/ techniques. Two systems are used to record behaviors, sign systems 

and category systems. The sign system uses a large number of behaviorally 
defined variables which are checked if they occur during a short; e.g. , 
/ five minute, observation period. Category systems deal with fewer 
variables (categories) and are recorded continuously. 

■5* 

Many of the problems cited for teacher rating methods are' eliminated ^ 

or vitiated through the use of systematic observation instrument^ . 

By using systematic observation, the observer is made a recorder, Insofar 
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as possible , rather thai>ai} evSluator (So^lSrs) . Alstf, according 
to Soar, this data tends to be "low inference^rather than "high inference", 
and stays closer to the original behayior. pHfiay be noted here> that 
although-low inference measures- stay with the behavior 9bsefved, higher 
inference measures agpe'ar to correlate better with some ir^^ic^s of ., 
student achievement; e.g. , 'Roserishine and Furst, 1971) ^Thls relates 
to the earlier discussion on the s;)ecificity of compete/icy statements . 
Systematic, observation techniques illustrate bow general (broad) 
competency statements can be clarified by describing the^e in terms of * • * 
several specifip items. . This appears to have several desirable effects 
when considering assessmetit. "Some of these have been' described-by. . 
McNeil and Popham (1973). "^For example, instruments which require l^ss. 
inference from the observer have a greater agreement among users . 
Reliability is also enhanced when the dimensions are clearly defined and- 
observers have hafl training, there is agreement on what is to be hQd$d ■ 
and there are fewer things for the observer to do during observation. 

It is of interest to note • . > • 

Recent studies using ratings of -intermediate levels of 
' inference, such as "clarity"-and "enthusiasm have 

produced more promising results than the earlier high 
inference ratings. However, before these results can be 
used maximally, the low inference behaviors which enter 
the ratings need to be identified (Soar, 1973, p. 208), 

Merwin '(1973) emphasizes that observati6n schedules, must focus attention 
of the rater specifically on those aspects' of performance relevant to the 
competency under Judgment. Also, procedures for comparing the recoq^d 
behavior with the standards set for the competency must b? clear and 
unambiguous. The degree of explicitness of the competency will be a larg 
determinpr of success in this process. J 

Although systematic observation techniques appear to have an advantage 
over rating systems, there are several factors to consider when utilising 
suchtechnif^uea . Reliability and validity are among these factors and 
have been treated in several' ways in the literature. We will first 
consider validity • 

It has been argued (McDonald , 1974; Abram^on, 1971, among others) 
that measurement procedures used in the evaluation of teaching 
• competency must have high validity. That is, there must be a 
demonstrated relationship between a teaching skill or performance and 
its effects upon students. However, in a number of sflidies that have 
attempted to relate pupil outcomes to classroom interaction variables, 
little relationship was found between pupil achievement and the observed 

• * 

'2b • « • 
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teacher classroom behavior. According to Abramson (1971) findings of 
such studies may result from incompatability of the achfevement and 
observation data collected. PupU achievement is collected on 
indiviliuals, while observations are group data. 

McDonald states, however, that the conclusion should not be drawn 
that we must defer the development of an evaluation system until all 
the relevant research has been done.. He argues that there is already 
an abundance of ideas on pertinent teaching competencies which we 
can begin to measure — a necessary first step, and whose effect on 
teaphing performance can be studded systematically as part of th,e process 
of developing evaluation systems (McDonald, 1974, p. 24). ^ 

Abramson (1971) considers the validity of observation systems in terms 
of content, concurrent, and construct validity; He defines these in 
the following manner; 

1) Content validity is the degree to which the system , 
provides information. that is representative of the - 
population of classroom behaviors that the system is 
meant to classify . • . It is essential that empirical 
evidence of the system's content validity be obtained . . . 

2). . The concurrent" validity of two' or more instruments is a 
function^of the agreeniept between the measurements 
resulting jfrom^the application of these instruments . 
Typically, a new instrument is shown to be valid if the 
results obtained jfrom its application are comparable to those 
obtained from a criterion measure, usually a more established 
instrument or a measurement with l^nown validity . This 
validation process using two or more (Jbservation systems could 
also be followed providing the criterion against which the 
new instruments are to be validated is itself valid. 

3) Cons;truct validity is the degree to which the hypothesized 
outcomes of the. practical application of the theory which 
gave rise to the instrument are borne out by the results 
of the appropriate experiments in which it has been used 
(Abramson, 1971, pp. 5-7) . * 

According to Medley and Mjt^el (1963), in order for an observational 
scale to be Valid for measuring behavior, it must provide an accurate 
V record of behaviors which actually occurred ' &c6^ed . in such a way that 
the scores are reliable . ^ \ 
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In addition 

The.vaiidity of measurements of behavior as the term 
is used here, depends then,;on the fulfillment of three ' 
conditions: 1) representative sample of. the behaviors 
to be measured must be observed. 2) An accurate record 
of the observed behaviors must be obtained . 3) The records ♦ 
- must be scored as to faithfuiiy refflect differences in 
behavior. ; : 

"^The first condition would fee fulfilled perfectly if the observed 
bel:>aviors were a single random sample of the behaviors to 
be measured. Unfortunately / it is -seldom feasible to obtain' 
a random sample in practice, so it is necessary to use 
' nonrandom samples v/hifth care to make them at least appear 
^to be representative. ..^•^^'^ 

The second^condition--acclirate record of behavior — and the 
third — meaningful scqring— are interdependent in the sense of 
how a record may be scored depends on how it is made, but 
they must be* kept separate using a technique (Medley and* 
Mitzel, 1963, p. 250). 

Reliability has received more attention than validity in the literature, 
and also has been viewed from several perspiectives . According to 
Abramson (,1971) the reliability of assessment procedures needs to be 
established, reliability referring to replicability of the measi^ment. 
and its underlying construct. 

According to Quirk (1972:) reliability is the sine qua non of the use of. a 
measurement device. If the reliability of a performance or a judgment 
is low, the prediction of subsequent performance based on that 
measurement device is not-likqly to increas^ very much above chance 
level. ' ^ 

Abramson (1971) reviewed some of the literature that dealt with the 
reliability of observational measurements and cohcluded that there were 
essentially two major procedures ncxmally used to establish the 
reliability of these data: 1) coefficients of observer agreement, and 
2) an analysis of variance jANQVA) technique first proposed and 
developed by Medley and Mit2er(1963). Most studies, including 
Flanders', have used the per cent agreement or Scott's (1955) coefficient 
of agreement between observers as their measure of reliability with , 
fewer studies reporting reliabilities based on the ANOVA technique • 

. According to Abramson, the coefficient of observer agreement and its 
variations may be thought of as roughly analogous to the test-retest 

* or alternate forms reliability of mosf'standardized tests because it 
provides a measure of cojmparability between two or more measurements of 
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samples drawn from a larger population 'of behaviors . However , the ^ 
major advantage of the.ANOVA technique ^ according to Abramson, 
results from its ability to partition the soiyces of variation inherent 
in the data into its component parts and thus yield error estimates as 
well as obtain estimates of true and total variance and calculate 
reliabilities using the classical definition rsa^true/s^total. It is 
thus possible to calculate reliabilities for the entire observation ^ 
schedule and for the individual items which comprise it. These : 
reliability coefficients and the encor estimates for the differen^ sources 
of variance may be extremely use/ul during the initial phases of item 
construction and revision. because these data perrrlit comparisons 
between the variances generated by items, observers, and teachers. 
Thus, through the ANOV^ technique, it is possible to obtain inter- 
observer reliabilities as well a^ other useful information (Abramson, 1971,. pp.: 4t5) • 

Medley and Mitzel (1963) define reliability as the extent to which 
the average difference between two measurements indejpendently obtained^ 
in the same classroom is smaller than the average- difference between 
two measurements obtained in different clai^rooms . According to 
Medley and Mitzel unreliability "can result from two measures of the 
same class differing too much due to the behaviors being unstable, lack 
ot agreiement among observers, different items lacking consistency; etc. 
It may alsa result from the differences between different classes being 
too small (Medley and Mitzel, 1963, p. 250). 

Medley and Mitzel defined three terms useful in reliability determinations. 

1) R e lia bil ity- co e f f icient refers to the correlation 
to be expected between scores based on 
observations made by different observers at 
different times . 

» 

2) Coefficient of observer agreement is the ^ 
correlatiGiH!iStv?reen scores based on observations 
made by different observers at the same time . 
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3) Stability coefficient is a correlation between scores 
based on observations made by the same observer at 
^ different times . . 

Using these definitions, the following argument is presented: 

The true score pertains to the typical behavior that 
would be observed in a classroom over a period of time, 
only a< sample of which is actually observed.. Then a 
coefficient of observer agreement does not tell us how 
closely an obtained score may be expected to approximate 
a true score, because the two measured correlated are 
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based on a single sample of b^avior. The txue score 
pertains also to the actual behavior which occurs, rather 
than to what some particular observer would see. There- 
fore, a stability coefficient does not estimate the 
• \ accuracy of a score either, since it is based on a 
, correlation between observations made by a single 
observer. The coefficient of observer agreement tells 
US something about the objectivity of an observational 
technique; the coefficient of stability telis us something 
about the consistency of the behavior from time to time. 
But only the reliability coefficient tells us how accurate 
\our measurements are (Medley andMitzel, 1963, p. 254). 

Further study of reliability is provided by Brown et al . , 

Per cent ox agreemer^t between observ.ers tells almost nothing 
about the accuracy of the scores obtained . It is entirely 
possible to find observers agreeing 99 per cent in recording 
behaviors on an'instrument whose. item or category consistency 
is vQry^poor. Reliability can be low even though observer 
agreement is high for several reasons'. For example, 
observers might be able to agree perfectly that a particular 
teaching practice dbcurred in g classroom, yet if that same 
practice occurs equally, or nearly so, in all classrooms, the 
• . reliability of that item as a measure of differences between 
teachers will be zero. Errors arising from variations in 
behavior from one situation or occasion to another can far 
outweight errors arising froni failure of two observers to 
agrjee exactly in their records of the same behavior 
(Brown, Mendenhall, and Beaver, 1968, p . 4) . 

Although reliability ana validity have received the most attention , there 
are a number bf other concerns related to systematia observation 
techniques. McDonald (1974) suggests that we must develop infcrrr»ation 
related to the reliability, validity, and the learnability of teaching ' 
skills. Also, according to McDonald, the inforrpation gathered must 
be un contaminated by subjective biases and political processes, and • 
the conditions of measurement must provide comparable information on 
groups of teachers. That*is, the conditions under which teacher behavior 
is measured, must be standardized. 

Flanders%* whose work has/been most influential in the development 
and utilization of systematic observation," has pcinteH out that choosing 
a particular system of interaction analysis tends to determine how one 
will conceptualize teaching (Flanders, 1974, p. 313). 
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Popham (1974) argiies that assessment energy shoAild be focused on 
the desired outcomes in leafners, that is, assess the end results 
directly without encountering the measurement noise associated with 
the extra assessment step involved in systematic observation. However, 
the difficulties which are encountered using product criteria will be 
discussed in a later! section. - . 

Other problems identified by Popham (1974) are that 

-deleterious factors may cancel out positive teacher 
behavior, and a manageable system could not pick up: 
all negative process variables, 

-observational approaches identify general classroom 
practices whereaspeacher evaluation requires personal ' 
and particular decisions , and . 

-there is .considerable danger that many teachers will 
"fake good . " 

.Flanders has identified needed improvements in this approach to assess- 
ment. These are: the need for mathematical models to help guide the 
conceptualization of interactive phenomena and.assist in establishing 
procedures for analyzing the data, attention to more effective methods 
of observer training and procedures for estimating the reliability of 
observation, and the development of multiple ceding within a single 
time frame and analysis of longer chains (Flanders, 1974). 

'The preceding techniques have primarily been used to assess teaching 
performance competencies in'actual classroom settings.' Due to the 
variety of problems posed by context variables previously described, 
some assessment procedures have been devised for simulated situations. 
The reader may recall that the nature of the 'competency statement also 
determines whether or not live classrooms are required or if simulation . 
is appropriate. A rationale for such a,n approach and some characteristics 
are provided in the following: 

Interaction skills are particularly difficult to measure . 
Attempts to do so with paper-and -pencil instruments 
have failed completely, mainly because no one has been able 
to devise test exercises' which call for the kinds of abilities 
that determine success in face-to-face interactions — the 
ability to"read" behavior, relate it to professional knowledge, 
and react almost instantafneously, for instance. 
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Attempts to measure interaction skills directly — that is, 
by observing the teacher in action with a class — ^^have 

" been more successful, in the sense that it has beien 
possiBle to identify some of these skilis and to observe 
performances at various levels of skill. Such attempts 
must fail as measuring instruments, however, because it 
•will never be possible to secure comparable samples of 

< the behaviors of different teachers from which measurements 
can be derived. ^ It has been impossible to confront any 
two teachers with the same problem (or equivalent ones) 
because no two pupils are alike — much less two classes — 
and no single pupil or class is the same after an experience 
as before it. 

What is needed is a procedure for simulating the problems 
,a teacher cucounters when he interacts with a class, a 
procedure which can be duplicated over and over so that 
more than one teacher can be confronted with the identical 

problem ' : * 

One approach that has been suggested and tried with limited 
success is to use a film or videotape recording of a" class 
to simulate the real one. The ^strength of this method lies 
in the realistic stimuli it can .present. When one sits or 
stands before the giant screen at Teaching Research in ' 
Oregon, where the Classroom Simulator was developed, and 
sees and hears the life-size , full colpr representation of a 
classroom before him, the approximation to confronting a live 
— ^lass i^startlingly cl9se. And when one intervenes — asks 
a pupil^ stop doing something , perhaps — and the pupil 
responds appropriately, the effect is even more realistic. 

Unfortunately, this do^s not always happen. Sometimes, 
the pupil's response is not so appropriate. Limitations of the 
equipment make it possible to offer only three alternative 
pupil responses per problem; and these three are not always 
perfectly synchronized. Nor can they include fully appropriate 
' follow-up to all the wide variety of leapOnSes teachers might 
make. And, finally, each problem must be short in duration 
since only one intervention -point can be provided. 

Two basic problems confront us when we try to simulate 
classroom interaction. One has to do with the difficulty in 
constructing a model which can generate appropriate reactions 
no matter what the teacher response may be, and when it 
comes , providing pupil reactions which are lawful and 
.predictable to all these possibilities . The other has to do 
with the difficulty of providing continuity because the nutnber 
of alternate stimuli needed increases at a geometric rate each 
time the teacher responds, and each alternative has to be 
worked out in advance, filmed, and programmed (Medley, 1969, pp 
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McDonald has also offered some alternative simulation assessment 
strategies and is optimistic about their use. One type is a filmed 
simulation test that portrays a teacher conducting a class. The film 
is -Stopped periodically and the viewer is asked to say what he or she 
would do in this situation; in other parts of the test the viewer is 
asked to explain what is occurring in the class, and, in some places, 
he is asked what advice or suggestions he vy^ould give to the teacher. 

Another involves the teacher arranging the subject mattjer in the form 
of presentations or questions, and the experimenter responds whenver 
the teacher asks a question. This gamelike situation does discriminate 
sharply between deductive and inductive teaching styles (McDonald, 1974, p. 

^ < ' 
The use of performance tests, such as those u^ed in simulated 

"Situations, have also raised several measurement concerns. According 
to Quirk (1971) compared to the more popular paper-and-pencU multiple- 
choice tests , performance tests are much more complicated to admitiistir, 
usually test only one individual at a time, require special .training for 
the observers/, are more difficult to score reliably* and are more expensive 
to administer and to score in terms of personnel time, equipment, and 
facilities. Test security is also a serious problem (Quirk, 1971, pp. 10-11). 

Quirk (1974 p. 317) also cites what he calls a host of critically 
important research questions about microteaching tests or other 
simulated tests. For example, how consistent is the teacher's behavior 
over time? What is the effect of familiar versus unfamiliar pupils on 
the behavior of the teacher? What is the effect of pupil practice on the 
teacher? How is teacher behavior related to pupil learning? What are 
the correlations between simulated teaching tests and phper-and-pencil " 
tests? So far, he^^aaerts, these questions far outnumber the adequate 
answers . » 

Affective Assessment 

The' affective area is.difffcult to assess, and. this is usually not 
subject to formal evaluation in teacher education programs. A variety 
of procedures for devdlopingaffective domain comi^etencies , however, have been 
developed and objectives of these activities have been established/ 
Competencies stated in this area must be evaluated, but due to the nature' 
of the area, competencies may be stated in broad terms and unique kinds of 
assessment strategies, such as unobtrusive measures and long term data, 
.may be required. 

Some general and somewhat "social" concerns voiced by Airasian (1974) 
include the question as to whom the judgments about a given student's values, 
personality, interests, and preferences disseminated, in what form, 
with what guidelines, and fpr how long? DeMarte etaK (1975) report that 



Since there are no right answers to emotions, attitudes, ' 
or feelings, and because human beings tend to "second guess" 
experimenters, the accuracy of any affective assessment can be 
questioned, particularly paper and pencil instruirients . Given 
this state of the art, affective instruments must be used with 
caution in teacher education (DeJ^arte et aK 1975, p. 2). 

As indicated in the early pages of this paper , some would consider this 
competency area to have two part^, teacher personality characteristics 
.and teaching behaviors in thB affective domain. It is possible/ it may be 
arguedi\hat a teacher- ca:; and does demonstrate sensitivity to students' 
needs, utilize students' ideas, and accept their feelings, and yet does 
not possess the. personality characteristics of warmth, sensitivity, or 
empathy. He may demonstrate the affective teaching skills because he 
has been trained to do so and beli^eves it is a good teaching technique. 
Whether or not this is an acceptable dichotomy is , of course, a moot 
point, but these two components will nevertheless be examined here. 

In terms of personality characteristics, it has been noted by Getzels 
and Jackson (1963) that very little is known for certain about the nature 
and measurement of teaching personality, or about the relation between 
teacher' personality and teaching effectiveness . 

Some approaches to assessment of personality are described by Sandefur 
(1970) such as the Minnesota Teacher Attitude Inventory (MTAI) , the ' 
California F Scale, and the MMPI. The fakeability of such tests has 
been noted as a potential source of error, particularly whert one can readily 
discern a preferred direction to fake, as on the MTAI. 

In measuring noncognitive variables such as attitudes, several researchers 
have turned to attitude questionnaires, similar to those above. An example 
,of a scale of this type was developed by Bogardus (1925) to measure 
social distance, or the closeness of the relationship to which the respondent 
is willing to admit members of designated social groups. Bogardus regarded 
degree of acceptance in terms of whether or not individuals would accept 
others: (1) to close kinship by marriage (2) to my club as personal chums , 
(3) to my street as neighbors, (4) to employment in my occupation of my 
country, (5) to citizenship in my country, (6) as' visitors only to my country, 
and (7) would exclude from my country. A general tolerance score is obtained 
by averaging the step values (ranging from one to seven) assigned by the 
respondent to each of the grpups he rated. Stern (1963) analyzed this type 
of attitude assessment and noted four issues when items are assembled and 
keyed arbitrarily in accordance with the opinions of the investigator: 

1) Are all items relevant to the Same measurement continuum? 

2) Are the items in fact ordered as steps along that continuum? 
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3) Is the relative distance Between the steps constant? 

4) Are the. responses actually a function of the attitude the 
Items were Intended to sample, rather than of some 

irrelevant process (Stern, 4963, p. 405). , 

■^Aiaewment of affective teaching competencies Is also not very encouraging 

Unquestionably the state ol Ihe art of affective assessment 
lags behind cognitive or psychomotor assessment. In the 
end. Interpretive Judgments based upon both formal and 
Informal observations and discussions will probably provide 
the Optimum .means blethering affective evaluative data 
about student progress. The lack of objectivity associated 
with such techniques In comparison to more formal paper-and- 
pencll techniques should not deter evaluation. One method 
of stressing the Importance of affective alms Is to diagnose 
and evaluate them (Alraslan, 1974, pp. 17-18). 

Among the devices used for assessment In this area are: systematic 
observation techniques (previously described) self-response questionnaires 
O-sort techniques, the semantic differential, and rating scales. 

In terms of rating scales it has been noted that 

. . ^the measuring devi(U is not the paper form but rather 
the individual rater. He\ce a rating scale differs in important 
respects from other paper-and-pencil devices. In addition 
■ to any limitations Imposed by the form itself, ratings are 
limited by the characteristics of the human rater--his 
inevitably selective perception, memory, and forgetting, his 
lack of sensitivity to whdt may be psychologically^^and socially 
important, his inaccuracies of observation and, in the case 
Of self-ratings , the well established tendency to put his 
best foot forward, to perceive himself in a more favorable 
perspective than others do (Remmers, 1963, p. 329). 

Rating scale's can be evaluated on the basis of the following criteria 

1) Q^iectivitv. Use of the instrument should yeild verifiable, 
reproducible data not a function pf the peculiar 
characteristics of the rater. 

2) Reliability . It should yield the same values, within 
the limits of allowable error, under tl»e same set of 
conditions. Since basically, in ratings, the rater and 
not the record of his responBB-. is the instrument, this 
criterion bolls down to the accuracy of observations by 
the rater . 
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3) Sensitivity . It should yield as fine as distinctions as 
are typically made In communicating about the object of 
Investigation. 

4) Validity . Its content. In this case the categories in the 
rating scdle, should be relevant to a defined area of 
Investigation and to some relevant behavioral science 
construct; if possgible, the data should be covarlant with 
some other, experimentally independent Inde^c. These 
requirements correspond to the concepts of definitional, 
construct, concurrent and predictive validity (American 
Psychological Association, et al . , 1954). 

5) Utility . It should efficiently yield Information relevant 
to contemporary theoretical and practical Issues; I.e. , 
it shp'old not be so cumbersome and laborious as to 
preclude collention of data at a /eaSonabJ,e rate 
(Remmers, 1963, p. 330). 

Guilford has categorized rating scales Into given major groups: graphic, 
standard, accumulated points, and forced-choice. He also pointed out 
that any such classification Is a very loose one, based on shifting 
principles (Guilford, 1954, pp. 2 63-30 1^. 

■■ f 

As m the other measurement devices considered In previous sections of 
this paper, reliability and validity must^be considered. 

Remmers (1963) states that using reliability statistics for soclometrlc 
data may be relatively meaningless and even misleading. For example, 
in test-retest coefficients there Is a problem of distinguishing between 
effects of memory and those of real change. If there; is too short an 
interval between testing, memory may play an Important part In increasing 
consistency of responsis, wherpas if the Interval Is too long , there 
may be real changes In group structure, thus lowering reliability coefficients 

In terms of validity, there are also fundamental differences between 
psychometric tests and'soclometrlc tests. That 'is, in a psychometrically 
derived test we try to measure some trait by eliciting some related 
responses. In a soclometrlc test the behavior Is actually sampled. In 
effect, the prpdictor is the same as the criterion, as long as we are not 
Interested In drawing l^rences from the behavior obselrved (Remmers, 1963) 

Also, there are human bias factors In rating scales . Thess include such 
things as 1) opportunity bias due to time sampling problems, 2) experience 
bias, that is the beha^vlor patterns may differ between those of an 
experienced teacher and a practice teacher , 3) criterion distortion which 
is error built Into a rating scale by Including several correlated behaviors, 
thut weighing the behavlpr disproportionately, and 4) rating biases 
due to various response sets (Brodgen and Taylor, 1950; Remmers, 1963). 
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\he areas of attitude, pfersonality , and affective domairT'B^petencies 
are much more comprehensive than this analysis can provide, but the 
concerns raised in this section are indicative of the problens involved 
*in assessment of this competdlicx domain. Two references which i)rovide 
an Invchtory and analysis of existing instruments are as iollows: 

DeMarte, Patrick; Johnson, Donald; Molenkamp, Alice, 
* "Report on the Affective Dimension in Teacher Education, " 
^ Rochester Area Colleges, Rochester, New York, 197S*^ , 

Beatty, Walcott, Improving Educational Assessment and An 
Inventory of Measures of Affective^ Behavior . Association for 
Supervision and Curriculum Development, NEA, Washington, < 
D.C. , October 1969. 

Product Assessment 

Consequence objectives may be the most interesting and controversial 
of the competencies • These require the teacher trainee to produce 
changes in students, usually achievement gains. The focus of assessment 
' In this situation is primarily on the students who are bqing instruqted 
by the teacher. Two different areas of focus include student achievement 
and the activity a student engages In. An example of the latter is "students 
being attentive to class activities." Teacher competencies and assessment \ 
approaches to this area have been described by Hatfield (1974) . He 
notes that competencies relating to students being attentive in class 
may include use of designated coni^rence techniques, techniques for 
controlling disruptive behavior of students, and managing overall activities 
In the classroom. In evaluating these types of teacher competencies at 
the performance level, two approaches could be used: 1) to see if the 
teacher actually used the techniques, and 2) to see if the teacher, in 
fact, achieves the purposes of the technique. The teacher is evaluated 
not just for using the technique but on whether the student is actually 
confronted in a meaningful way and responds to that confrontation 
(Hatfield, 1974). 

Problems related to evaluation of teacher performance desci^ibed in 
« 1) above are discussed in the section on^ performance assessment. In 2) 
the teacher is evaluated on the basis of whether the student \"responds 
to that confrontation, " or "if the student actually becomes attentive to the 
activities." If the determination Df this is left to the judgment of the 
observer, the problems of o'bservation techniques as previously discussed 
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must be considered. These include such factors as "halo effect" and • 
other response sets, reliability of observers, and -sampling concerns, 
among others . If an attempt is made to make the determination more 
objective , then there is a problem of .establishing a criterion level; e.g . , 
'how many students must be attentive . It would appear that the more 
subjective approach utilizing professional Judgment is the more viable 
approach at this time, but "is nevertheless not appealing from a measure- 
ment point of view. ' 

Medley, Soar, and Soar (1975) believp that assessing teacher competence 
on the basis of pupil behavior is not appropriate. Among their concerns is 
one of morality, that one huma'n being's advancement is deps-ndent on the 
behavior of another {the pupil), which is not and should not be entirely 
under his/her control. / 

Evaluating teacher performance utilizing student achievement (as measured 
by test scores) as the criterion of effectiveness, has also received attention 
in competency-based education programs. Medley, Soar^, and Soar (1975), 
however, contend that evaluation of teaching through evaluation of, pupil 
outcomes is not a viable strategy. 'Several problems have been cited, and 
again reference can be made to Turner's criteria. 

> • * . 

Using changes in pupil behavior over a long period 
(Turner'* Level 1) or shorter period (Turner's- Level 2) 
as the mOasure of performance of a'^teacher candidate 
to make the "go-no go" decision on development of a 
competency poses several complexities in addition to 
those set forth above,. They include the need to state 
the competency in terms of pupil behavior, assessment 
in'terms of a change in behavior based on a minimum of two 
observations (before and after intervention by the teacher), 
observing and recording performance relevant to the teacher 
competency under consideration, and most problematic of all, 
accurately identifying the teacher's contribution to the. 
change observed (Merwin, 1973, p. 13). 

) 

Airasian (1974) states that the data which must be gathered to evaluate \ 
teacher's effects upon student learning are not at all clear. Research 
indicates that a large portion of the variance in student ability anfl 
achievement is attributable to early environmental factors. Also, 

The attribution of causation aspect offers an even gceater- 
challenge if the competencies are written in terms of ability 
to bring about change in pupils, the process must involve 
■separation of those changes attributable to the teacher's 
efforts from those that cannot be so attributed. Children's 
learnings afe affected by interactions with other children, the 
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extentToTwhich tEelr parfents are interested'arid become 
involved in what they learn, what they see on TV, how the 
school is organized, th^ scheduling of Jheir tiifte by others , 
and a host of other factors. Since these factors will impinge 
on different4t)upils in different ways, one can hardly say 
^hat one teacher has demonstrated "competency" and another - 
has not simply on the basis of changes in the performance of 
their two groups of pupils (Merwin , 1973, p. 14). 

/ 

Okey and Humphreys add that Ifttle is known about how to adjust expectations 
of teacher success when they work with pupils that have different entering 
abilities, backgfrtourids, aptitudes, motivation, and learning rates. 
Differences in subject matter difficulty, instructional materials, and 
classroom settings may also have important effects on pupil achieveifient , 
and therefore, teacher cpnsequence measures (Okey and HumphreTsi 1974, p. 8) 

According to Flanders (1974) one diffici^ty with measures of learning is 
the overemphasis on subject matter achievement, Flanders suggests that 
using a test of subject matter as the only criterion of learning is inadequate, 
because student learning includes much more. For example, staying in 
school and not dropping out, learning %o like school and the process of 
learning, gradually learning how to be more self-directing and independent, . 
learning how to make moral and ethical Judgments, etc. , may be more 
important measures of teaching than are scores on content tests. Also, 
given a focus of subject matter and a research design consisting of 
pretest, teaching -learning, and posttest,it was found that posttest 
achievement is much more strongly associated with pretest scores (at least 
ten times more) tnan it is with any measure of teaching. This is due to 
the pretest to posttest gain being mainly a function of ability, and 
therefore in any assessment of teachinc^, student ability would have to be 
controlled more thoro.ughl\r^_Also, standardized^cW 

designed to be. insensitive to the intluence of a particular teacher and 
reflect, instead, the tdtal developmental background of the student. . 
In summary, Flanders states that conclusions are not really about teaching 
effectiveness; instead, they are about student effectiveness 
(Flanders, 1974, p. 312) . ^ "* 

At the other end of the spectrum are the measures of pupil outcomes, 
particularly the criteria used to assess these. Abramson (1971) in 
discussing product criteria as a measure of teaching performance, points 
out the problem of the ultimacy of the criteria. For example, does 
effective teaching reflect gain in immediate factual knowledge, or improved 
skills'of an intermediate nature, or ability to apply these fa^ts and skills, 
or the more comprehensive "success" in life types of skills 
(Abramson, 1971, p. 2) . - ° 
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Aiso, Soar (1973) argues that attempts to measure teacher competence 
through pupil gain in higher level objectives appears to be exceedingly 
difficult andi probably impossible in many cases . McNeil and Popham (1973) " 
cite technical problems in assessing learner growth such as concerns 
about the adequacy of measures for assessing a wide range of pupil 
attitudes and achievement at different educational levels and in diverse 
subject-matter areas , failure to accoiyit for instructional variables that 
the teacher does not control, and the unreliability in the results of 
teacher behavior, that is , inconsistent progress of pupils under. the same 
teacher. . • » 

Further problems which ar^e xelated toCTe. analysis and interpretation of 
learrving scores, according to Stake (wii^S) include: grade-equivalent 
scores', the "learning calendar," the unrellai5ility of gain scores, and 
regression affects. Instructional" spec il^lists (Hively, Patterson, 
add Page, 1968), according to Stake, have quesUoned the a ^jiropr lateness ^ 
of grade equivalents or any other "norm referencing" for interprdtlng 
items . They object to defining performance primarily by indicating who 
else t)erforms as well. That is, the items on all standardised tests have- 
been selected on the basis of their ability to discriminate between the 
more and less sophisticated students rather than to distinguish whether 
or not a person has mastered his task, indicating successful attainment 
of the instructional objectives. Grade equivalents are too gross to 
measure individual shortj-term- learning (Lennon, 1971; Stake, 1973). 

In terms' of the learning year, there is some, basis for miscalculations. ♦ 
For example, -winter is a time , for' n^ost rapid'academic advancement, summer 
the least. Also, there 'is a common belief that schooling should not aim 
at terminal performance, but rather at continuing performance in the weeks 
and months and years that follow . 

Concern with the unreliability of gain scored can be viewed in the manner 
described by Quirk (1972), or by Stake (1973). Consider for example, 
using a typical standardized achievement test with two parallel forms, 
A and ^> each having a reliability of +.84 . Their correlation (that' is , 
the correlation of parallel forms Test A with T^st B) in his example was 
^^.81. And, in using a standard formula (Thorndike and Hagen, 1969) 
the- reliability of gain scores (A-B,or B-A) would be +.16. Using the 
raw score and grade equivalent standard deviations from the test's 
technical manual, assuming 9.5 items and 2.7 years respectively, on 
the average, a student's raw score would be in error by 2.5 times; 
his grade equivalent ^core would be in error by .72 years, and his 
grade equivalent gain score would be in error by 1.01 years 
(3take, 1973, p. 215) . 
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Regression effects r i.e., initially low scores tend to move up toward 
the mean while initially high scores t^nd to drop rather than gain, . 
have also caused some misinterpretation of the effects of instruction. 
Lord (1963) discussed this universal phenomenon, and various ways to 
set upra proper correction for It. # ^ * . - • 

In spitfe of these concerns , the evaluation of teacher performance 
utilizhig student achievement as the criterion of effectiveness has. 
received considerable attention in competency-based teacher education 
pr0|[rams. One such attempt, is the utilization of microteaching and 
the development 6f teaching tests. McDonald argues that by stipulating 
the objective, providing the teaching materials, and controlling the 
variability of the pupils / the degree of the teacher's skill may be 
assessed . Also, a teacher's skill c_an be assessed under a variety of 
different teaching conditions using midiroteaching sessions . However, 
this approach still has several limitations such as relatively short 
lessons and a small number of students used.^ Therefore, McDonald and^ 
others developed a mini-course iormat to use for more complex teaching 
situations . the results of his analyses of these teaching performances 
indicate that the microteaching performances are relatively gpoor predictors 
of the teaching performances in the mini-courses. He concludes, however, 
that the midroteaching is more usefiil^or assessing the degree to which a 
teacher has basic skills, whereas the mini-rcourse is most useful in 
assessing how teachers integrate these skills into complex teaching . 
-performances. * / 

In discussing student teaching .and internship experiences, McDonald 
suggests that these can be used to assess daily performance under 
uncontrolled conditions . They are useful for providing information on what 
teachers are likely to do in contrast to what they are able to do. Also, 
on-the-job observation can be used to assess such factors as Reaching 
style (McDonald, 1974, p. 24). 

In student 'teaching, some writers (e.g*,Okey and Humpihreys, 1974) 
suggest applying consequence objectives via criterion referencing.. 

A number of concerns have been directly related- to the teaching test 
approach. For example, teaching performance tests may "have insufficient 
reliability to permit their effective use in teacher- evaluation (Glass, 1972). 
Medley, Soar, and Soar point out that teaching tests can only measure 
how effective a teacher is in achieving short-term goals, which are the 
least important goals of ^ucatipn. Alsg, they point out, stability coefficients 
(which describe correlations between mean gain scores ol two classes taught 



by the same teacher) are around .3, certainly not acceptable, Milmap (1973) 
suggests' that with more reliable measures utilizing more items, * 
collected on larger student groups, after longer instructional sessions, 
^uch teaching performance tests^ill be a mor^ reliable indicator of 
teaching effectiveness . ' ' - • ^ 



In concluding this s^ectionl Airasian's comment appears appropriate. 
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In sum, while it is ah/^ays possible to evalute teaching 
competency- by measuring student learning, the issues remaining 
to be settled before such- evaluation can be undertaken in an 
intelligent manner, fait to both teachers and students, suggests 
that student learning measures not be used to" evaluate 
individual teachers at present (Airasian, 1974, p. 19). 



Expei;iences Assessment ' " . r 

■ ' ■ 
^ Expressive objectives have no pre-determined outcomes^ they require 
" only the experiencing of jcerta in activities . „ In this case it may be 

necessary only to evaluate whether or not one has indeed participated 
^in the experience. A check list of necessary activities is one means of 

assessing whether or not the individual has participate appropriately. 

In those. cases where observation of the activity does jiot-occur, other 
;^kinds of evidence may be required ^ such as diaries, descriptions, or^ 

testimonials that the individual was present. Since this domaiA 

requires little data, it is the easiest to "assess'*' but also yields 

information of a less rigorous nature. 



Summary 

\Competencyy6ased teaciier education has been defined in various ways- 
but there is general agreement on at least two basic elements. The first 
essential characteristic is the specification of teacher competencies 

' which form the basis of the entire program. The second is thadesign^ 
of assessment techniques directly related to the specified competencies . 

Competencies have been written in a variety of ways and have been 
related to various domains or competency areas .. Th^, competency 
domains identified in the literature are knowledge, behaviors 
(performance), attitudes , consequences,and experiences . Ther^f also 
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seems^to he a variety of viewpoints as to how compertencies should be 
^written. One approach is to write them as general statements of behavior - 
with some broadly defined expected level of achievement. Another ' 
approach-is to ^develop specific performance objectives derived from 
the competency statement. Competency statements may also.be written, 
as behavioral objectives . In each of the' competency c^Qinains cited the 
form vOf the competency statement must be examined to determine 
appropriate assessment techniques • 

There are a number of assessment factors In general which need to be- 
considered in the evaluation of competencies . The nature of the standards , 
"^that is, criterion selection, is an essential aspect. Other cpncprns 
are comprehensiveness and fidelity of the assessment system"; validity 
and reliability of data, and general utility of the process . In addition^ 
Turner has provided six criterion levels for competency evaluation which 
provide a framework for identification of assessment areas . 

Assessment of knowledge competencies genera^y can be accomplished 
through paper-and-pencil testing. Of all the assessment areas the 
knowledge domain is the most developed. Inherent in this pjrocess is . 

terion -referenced testing . A problem one then encounters* is the 
application of traditional psychometric characteristics of tests The 
setting of criterion levels also has many difficulties associated with it. 

Teacher outputs were identified as possibly being a unique group' of 
teacher competencies as opposed to being classified under the knowledge 
or performance categorie*. Outputs represent primarily observable 
dimensions of teacher productivity and serve as a bridge for connecting* 
teacher behavior with^learneE^-outcomes . 

Assess;nent of teaching behaviors or performances requires observation 
of the Individual demonstrating tlie skill. This may be accomplished by 
rating scales or structured observation s'ystems (systematic observation scales). 
• It has been airgued that teaching performance rather than pupil learning- should 
be the focus of assessment because measuring teacher effectiveness by 
measuring change in pupils is probably only possible for simpler lower 
level objectives. Assessing teacher performance deals only with the 
loWer levels of Turner's criteria. Problems encountered in this competency 
area relate to establishment of criterion levels, comparability of conditions, 
and observation errors.. 

Other elements to be considered in performance assessment are the nature 
of the content being taught, the background of the pupils being taught, 
and general effects on learning which maV pot be accounted for. An ^ 
extremely important consideration is the context of performance assessment. 
How the competency is defined*, the focus of investigation, and the 
criteria are all said to be a function of the context in which ass^sment 
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is to take place. It has also been stated .that the identification of the 
context in which competencies are to be demonstrated becomes as critical 
'as the identification of the competencies themselves. Some competencies 
m^y be demonstrated under simulated conditions while .others require a 
classroom setting. . , ' ' 

One aspect which received considerable attention is that of sampling , 
and thus the relationship of,an individual's performance at a given point 
in time to his a^ual ability to ^demonstrate a competency should he so- 
choose. The predictive relationship between performance and competence 
ig affected by adequacies in sampling. 

Perhaps the mostcwidely used method of assessing teacher performance 
is subjective rating where an observer evaluates the candidate through 
observation and possibly through the use of some type of checklist. 
One method of assessing teacher performance that has received considerable 
attention is the use of systematic evaluation techniques. The importance 
of the specificity of the competency statement is evident in the use of 
systematic observation techniques. The more specific the competency 
statement, the lower the inference level in arriving at evaluation decisions. 

Two important considerations in the use of systematic observation scales 
are validity and reliability. Content, concurrent, and construct validity 
are areas which raUst be accounted for. Reliability has been identified 
as the essential element ip the use of a measurement device, and a variety 
of reliability perspectives haVe been described. Coefficients of observer : 
agreement and analysis' of variance have been used to.'determine reliability. 
Three aspects of reliability are the reliability coefficient, the coefficient • 
-of observer agreem'ent, and the stability coefficient. Other problems for 
consideration are standardized conditions of observations, deleterious 
effects on teacher behavior, and fakeability under such conditions . 

.Simulatiort is one approach that has been sugges(ed and tried in various 
means. Many extraneous va.riables are controlled in such situations but 
•there-is ,a concomitant loss of test fidelity, although this is much more 
realistic than paper-and-pencil testing. 

The area of atUtudes is difficult to assess and is^usually not subject 
to formal evaluation in teacher education programs . A distinction has 
been made between personality characteristics and affective competences . 
Approaches to -assessment of personality are' primarily projective techniques 
"instruments which have been utilized more frequently are the Minnesota 
Teacher Attitude Inventory, the California F Scale; and the MMPI . The 
fakeability of such tests has been noted as a. potential source of error. 
' Among the devices used for assessment for affective teaching competencies 
. are systematic observation techniques, self response questionnaires, 
Q-sort techniques, the semantic differential, and rating scales. 

44 . ' 
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A^m,ajor concern implied throughout the paper is the need to examine the 
feasibility of assess'ment of a given^domain prior to making a decision as 
to whether or not competencies should be written for that area and in 
what form. Although assessment in the attitude domain is faced with a 
variety of problems, it would be dangerous for a progre^m to exclude 
competencies in this area because they cannot be readily assessed. 

Consequence objectives require the teacher trainee to produce changes 
in the student, usually achievement gains , although the activity a 
student engages in is another possible criterion. In evaluating activities 
students engage in, a number of problems are encountered such as the 
causal relationship between teacher performance and student activites, 
observation problems such as halo effect, sampling concerns, and others. 

Evaluating teacher performance utilizing student achievement also has 
number of serious problems. Some research iridicated that a large portion 
of the variance in student ability and achievement is attributable to early 
'environmental factors. Other concerns are the ultima cy of the criteria, 
adequacy of measures for assessing pupil gains at different levels, and 
ir^ different arAs, reliability of gain scores, and regression. effects . It 
has beeh concluded that student learning measures cannot bafairly used 
to evaluate individual teachers at present. 

Expressive objectives do not have pre-determlned outcomes, they require 
only the experiencing of certain actJijjities . Instruments uged in this 
domain include checklists-, descriptive reports, anecdotal records, etc. 
Since this domain requires little data it is the easiest to "assess" but 
also yields information of a less rigorous aature. 
fl - • . . ^ . * • 

Epilogue - 

In analyzing assessment problems related -to teacher competencies , the 
author has attempted to synthesize .the diverse opinions on a variety 
of "assessment concerns found in the educational literature. There may be 
some areas of importance, however, which have been omitted or have not 
been given appropriate depth of treatment It is also possible that 
conflicting or alternative viewpoints on certain aspects have not been 
presented. The author is interested in any information which would 
clarify or otherwise contribute to this paper, and would welcome readfers 
to send their comments . ' ' , . 
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